1 Unigrams

  • The columns SV_ppb and NSV_ppb are the list-specific frequencies of the unigrams (so the sum of each column should be 100).
  • For a given row, SV_ratio is SV / (SV + NSV).

1.1 Onsets

1.2 Medials

1.3 Nuclei

1.4 Codas

1.5 Tones, live syllables

1.6 Tones, dead syllables

2 Bigrams

Notes

  • All segments are treated as “positionally specific”. That is, final -k and onset k are not the same k for purposes of determining unigram frequencies (and therefore pointwise mutual information). This is partly because what we are interested in is the positional stickiness, and partially because they are arguably different (phonetic) segments.
  • Hover over a cell in the heatmaps to see the exact count of bigrams for that cell.
  • In the heatmaps only, bigrams with n=1 are not shown.

Key to tables

  • n_SV and n_NSV are the raw counts of this bigram pair in the relevant list
  • PMI_SV and PMI_NSV are the pointwise mutual information scores for the pair in the relevant list. PMI describes the increase or decrease in the cost of describing a segment in a particular environment (here, under a bigram model). Positive PMI for a sequence AB in list L means that when we observe segment A, we expect that segment B will follow, whereas negative PMI means that we are more surprised to see B, given that we’ve seen A. PMI is green when it exceeds 0.25 and red when it is less than -0.25.
  • SV_ratio is n_SV / (n_SV + n_NSV).

2.1 Onset-nucleus

2.2 Onset-medial

2.3 Onset-coda

2.4 Onset-tone

2.5 Nucleus-coda

2.6 Coda-tone

3 Trigrams

3.1 Onset, medial, nucleus

3.2 Onset, medial, coda

3.3 Medial, nucleus, coda

3.4 Nucleus, coda, tone

4 Syllable structure

4.1 Possible and attested syllables

  • possible is the count of possible syllables of this shape. What counts as a “possible” syllable? Different ways to do it; here we assume:

    • 24 “plain” onsets (including ʔ but excluding w; we distinguish orthographic d gi in addition to s x)
    • 12 nuclei [aː e əː ɛ i ɨ ɔ o u iə ɨə uə] with unrestricted distribution following plain onsets
    • 2 nuclei [a ə] that cannot occur in open syllables
    • 17 “labializable” onsets [ɗw tw tʰw sw zw lw rw cw ʂw ɲw ʈw kw xw ɣw ŋw hw w] (we treat w here like a labialized ʔw for co-occurrence reasons) which may not be followed by [ɨ ɔ o u ɨə uə] (ostensibly the single exception is quốc but it is typically pronounced [kwək])
    • 3 nasal codas [m n ŋ] and 3 unreleased plosive codas [p t k]
    • 2 semivowels [w j] with restricted distribution: [j] cannot follow [i iə e ɛ] and [w] cannot follow [əː ɔ o u uə]
    • a “null” coda that can only occur with 12 of the 14 nuclei
    • 6 tones that can occur with sonorant or null finals
    • 2 tones that can occur with obstruent codas
  • SV and NSV are the counts of syllables of these shapes in the SV and NSV lists, respectively

  • pct_SV_shape and pct_NSV_shape are the percentages of the possible number of syllables of this shape that occur in the SV or NSV lists, respectively. pct_poss_shape is simply the sum of pct_SV_shape and pct_SV_shape.

  • pct_poss_total is the sum of the SV and NSV counts for this shape, divided by the total sum the the possible column (17,526).

Takeaways:

  • Out of about 17,500 possible syllables, roughly half are attested, and of that half, about 25% are SV
  • “Possible” syllables are extremely unevenly distributed. Out of all possible CV sequences (including tones), nearly all are attested, while only about half of all possible CVN sequences are.
  • Only around 5% of attested syllables have a Cw- onset, compared to 30% of possible syllables (as calculated here). Thus, it may be more accurate to state that, generally speaking, Vietnamese makes use of almost the entirety of the space of possible CV syllables, but only about half the possible space of C(C)VC syllables.

4.2 Canonical syllable shape

Trần & Vallée 2009 report that “the prevalent monosyllabic pattern in Vietnamese…was the CVC syllable type, respectively 70% and 34% of the monosyllabic words, and respectively 70% and 20% of the language syllable inventory” (2009:232). Their counts were derived from a list of words with frequency above 2% in a 5,000 word lexicon. If we collapse the above table into their three categories (CV, CVC, CCVC), we see the numbers are quite close: about 21% C(C)V, 71% CVC and 8% CCVC.